Data Warehouse Modeling and Quality Issues
نویسنده
چکیده
Data warehouses can be defined as ‘subject-oriented’, integrated, time-varying, non-volatile collections of data that is used primarily in organizational decision making. Nowadays, data warehousing became an important strategy to integrate heterogeneous information sources in organizations, and to enable On-Line Analytic Processing (OLAP). Unfortunately, neither the accumulation, nor the storage process, seem to be completely credible. For example, it has been suggested in the literature that more than $2 billion of U.S. federal loan money have been lost because of poor data quality at a single agency, that manufacturing companies spent over 25% of their sales on wasteful practices, a number which came up to 40% for service companies. The question that arises, then, is how to organize the design, administration and evolution choices in such a way that all the different, and sometimes opposing, quality user requirements can be simultaneously satisfied. To tackle this problem, this thesis contributes as follows: The first major result that we present is a general framework for the treatment of data warehouse metadata in a metadata repository. The framework requires the classification of metadata in at least two instantiation layers and three perspectives. The metamodel layer constitutes the schema of the metadata repository and the metadata layer the actual meta-information for a particular data warehouse. The perspectives are the well known conceptual, logical and physical perspectives from the field of database and information systems. We link this framework to a well-defined approach for the architecture of the data warehouse from the literature. Then, we present our proposal for a quality metamodel, which builds on the widely accepted GoalQuestion-Metric approach for the quality management of information systems. Moreover, we enrich the generic metamodel layer with patterns concerning the linkage of (a) quality metrics to data warehouse objects and (b) of data warehouse stakeholders to template quality goals. Then, we go on to describe a metamodel for data warehouse operational processes. This metamodel enables data warehouse management, design and evolution based on a high level conceptual perspective, which can be linked to the actual structural and physical aspects of the data warehouse architecture. This metamodel is capable of modeling complex activities, their interrelationships, the relationship of activities with data sources and execution details. The ex ante treatment of the metadata repository is enabled by a full set of steps, i.e., quality question, which constitute our methodology for data warehouse quality management and the quality-oriented evolution of a data warehouse based on the architecture, process and quality metamodels. Our approach extends GQM, based on the idea that a goal is operationally defined over a set of questions. Thus, we provide specific “questions” for the full lifecycle of a goal: this way the data warehouse metadata repository is not simply defined statically, but it can be actually exploited in a systematic manner. Special attention is paid to a particular part of the architecture metamodel, the modeling of OLAP databases. To this end, we first provide a categorization of the work in the area of OLAP logical models by surveying some major efforts, including commercial tools, benchmarks and standards, and academic efforts. We also attempt a comparison of the various models along several dimensions, including representation and querying aspects. Our contribution lies in the introduction a logical model for cubes based on the key observation that a cube is not a self-existing entity, but rather a view over an underlying data set. The proposed model is powerful enough to capture all the commonly encountered OLAP operations such as selection, roll-up and drill-down, through a sound and complete algebra. We also show how this model can be used as the basis for processing cube operations and provide syntactic characterizations for the problems of cube usability. Finally, this thesis gives an extended review of the existing literature on the field, as well as a list of related open research issues.
منابع مشابه
A Data Warehouse Architecture for Clinical Data Warehousing
Data warehousing methodologies share a common set of tasks, including business requirements analysis, data design, architectural design, implementation and deployment. Clinical data warehouses are complex and time consuming to review a series of patient records however it is one of the efficient data repository existing to deliver quality patient care. Data integration tasks of medical data sto...
متن کاملA Review of Contemporary Data Quality Issues in Data Warehouse ETL Environment
In today’s scenario, extraction–transformation– loading (eTl) tools have become important pieces of software responsible for integrating heterogeneous information from several sources. The task of carrying out the eTl process is potentially a complex, hard and time consuming. Organisations now –a-days are concerned about vast qualities of data. The data quality is concerned with technical issue...
متن کاملAn Exploratory Investigation of System Success Factors in Data Warehousing
Despite the increasing role of the data warehouse as a strategic information source for decision makers, academic research has been lacking, especially from an organizational perspective. An exploratory study was conducted to improve general understanding of data warehousing issues from the perspective of IS success. For this, the effect of variables pertaining to system quality, information qu...
متن کاملData warehouse process management
− Previous research has provided metadata models that enable the capturing of the static components of a data warehouse architecture, along with information on different quality factors over these components. This paper complements this work with the modeling of the dynamic parts of the data warehouse. The proposed metamodel of data warehouse operational processes is capable of modeling complex...
متن کاملData Warehouse for EIS: Some Issues and Impacts
Data warehouse is one of the most rapidly growing areas in management information systems. With this approach, data for EIS and DSS applications is separated from operational data and stored in a separate database called a data warehouse. Some of the advantages of this approach are improved performance, better data quality, and the ability to consolidate and summarize data from heterogeneous le...
متن کاملA Data Warehouse Architecture for Brazilian Science and Technology Environment
Science and technology in Brazil are areas that have few available resources and many times these scarce resources are badly used. The data warehouse is a tool that can make possible a better distribution of these resources. In this article are considered some issues in the development of a data warehouse for Science & Technology management. The paper describes the necessity of a supporting sys...
متن کامل